Summarising Data

DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods

Learning Outcomes

  • Understand the concept of data, observations and variables
  • How to visualise numeric and categorical variables
  • How to summarise numeric and categorical variables

Data

In statistical parlance, data is a plural word referring to a collection of numbers or other pieces of information to which meaning has been attached

— Utts (2014)

Observations, values, and variables

An observation is an individual that we measure or categorise data about

  • An apple
  • A hospital

A value is a singular piece of data about an observation

  • The weight of an apple picked from Tree A1
  • The percent occupancy of beds at Waikato Hospital

A variable is a collection of one type of data collected on all observations

  • The weight for all apples picked from Tree A1
  • The percent occupancy of beds for all hospitals in the Waikato region

More on variables

… we measure or categorise data about

Numeric Quantitative
Data that can be described with numbers

  • Discrete Integer
    ‐ Number of siblings
    ‐ Age (if measured in years)
  • Continuous Real
    ‐ Diastolic & systolic blood pressure
    ‐ Strength of a gravitational wave

Categorical Qualitative, Factor
Data that is best described with text

  • Nominal No order
    ‐ Te Whatu Ora – Health NZ regions
    ‐ Eye colour
  • Ordinal Natural order
    ‐ Number of Wordle guesses
    ‐ Highest educational degree earned

Tidy data

In DATAX121, most, if not all, datasets follow tidy data principles

There are three interrelated rules that make a dataset tidy:

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.

— Wickham et al. (in press)

CS 1.1: NZ income snapshot in 2011

Synthetic sample data based on real data from the June quarter 2011 NZ Income Survey1. The survey was an annual snapshot to produce income statistics on New Zealanders aged 15 and over based on a representative sample of the population.

Variables
ethnicity A factor denoting the ethnicity with 6 levels
region A factor denoting the region of residence
gender A factor denoting the gender, male or female
agegp A factor denoting the five year age-band. Note that the value 65 describes an individual aged 65 or older
qualification A factor denoting the highest qualtification level with 5 levels
occupation A factor denoting the category of the main income source with 10 levels
hours A number denoting the weekly hours worked from all wages and salary jobs excluding self-employment
income A number denoting gross weekly income from all sources ($)
nzis.df <- read.csv("datasets/NZIS-CART-SURF-2011.csv")
nrow(nzis.df) # Prints the number of observations
[1] 29447
colnames(nzis.df) # Prints the variable names
[1] "ethnicity"     "region"        "sex"           "agegrp"       
[5] "qualification" "occupation"    "hours"         "income"       

CS 1.1: NZ income snapshot in 2011

View(nzis.df) # Opens a spreadsheet view of the dataset 

Exploratory data analysis (EDA)

The simple graph (plot) has brought more information to the data analyst’s mind than any other device.

— John Tukey

The goal of EDA is that we are simply exploring our data with visualisations and descriptive statistics1

In DATAX121, we cover the basic suite of EDA tools to describe features of the data

One numeric variable

  • What is the general distribution of the data?
  • Where are the values centred?
  • How do the data vary?

Dot plot

stripplot( ~ income, data = nzis.df, 
          xlab = "Gross Weekly Income ($)",
          main = "NZer's gross weekly income snapshot in 2011")

Figure: The gross weekly income of 29447 New Zealanders

Arguments
xlab takes text to label to horizontal (\(x\)) axis
main takes text to label the title of the plot

The variable is plotted as-is on a number line

A function from the lattice R package1 was used to create this plot

One issue with this plot for CS 1.1 is overplotting—we have values that are plotted on top of each other

The gross weekly income seems to be centred at about $2,500. Most of the data seems to be between -$1,000 and -$10,000

Dot plot with jitter

stripplot( ~ income, data = nzis.df, jitter.data = TRUE,
          factor = 10, xlab = "Gross Weekly Income ($)",
          main = "NZer's gross weekly income snapshot in 2011")

Figure: The gross weekly income of 29447 New Zealanders

Arguments
jitter.data takes a value of either TRUE or FALSE
factor takes a number to determine the extent of the jitter

The variable is plotted on a number line and the values are randomly spread along the other axis

Jitter helps avoid overplotting for datasets like CS 1.1. Also, the default extent of the jitter may need to be tweaked per dataset

Jitter also visualises the density of values

The gross weekly income seems to be centred at about $1,500. Most of the data seems to be between -$500 and $5,000. The data is skewed to the right

Box plot1

bwplot( ~ income, data = nzis.df, pch = "|",
       xlab = "Gross Weekly Income ($)",
       main = "NZer's gross weekly income snapshot in 2011")

Figure: The gross weekly income of 29447 New Zealanders

Arguments
pch = "|" to plot the median with a line instead of a dot
Add coef = 0 to only the “whiskers” without the outliers

The variable is summarised with five descriptive statistics, with outliers (by default), then those features are plotted on a number line

Box plots avoids overplotting by plotting summarised data (and outliers). More on this later!

However, the some features about the distribution of the data are hidden

The median gross weekly income seems to be about $500. The central 50% of the data seems to be between $0 and $1,000. The data is clearly right-skewed

Histogram

histogram( ~ income, data = nzis.df, nint = 50, type = "count",
          xlab = "Gross Weekly Income ($)",
          main = "NZer's gross weekly income snapshot in 2011")

Figure: The gross weekly income of 29447 New Zealanders

Arguments
nint takes a number to determine the number of intervals
type takes a value of either "count" or "percent"

The variable is visualised as a set of bars whose widths are equally-sized intervals, but the heights are determined by the number of values within the interval

Histograms avoids the overplotting issue by summarising the frequency of values based on equally-sized intervals

The default number of intervals may need to be tweaked per dataset. Also, histograms may not be suitable for “small” datasets

The gross weekly income seems to be centred at about $1,000. Most of the data seems to be between $0 and $2,000

CS 1.2: Old faithful

Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA

Variables
eruptions A number denoting the eruption time (in minutes)
waiting A number denoting the waiting time to the next eruption (in minutes)
data(faithful) # A dataset that comes with R
nrow(faithful)
[1] 272
head(faithful)
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

Comparing dot plot with jitter & box plot

Figure: The waiting time between eruptions for the
Figure: Old Faithful geyser in Yellowstone National Park

stripplot( ~ waiting, data = faithful, jitter.data = TRUE,
          factor = 3, xlab = "Waiting time (minutes)",
          main = "Waiting time between eruptions") |>
  print(split = c(1, 1, 1, 2), more = TRUE)
bwplot( ~ waiting, data = faithful, pch = "|", coef = 0, 
       xlab = "Waiting time (minutes)") |>
  print(split = c(1, 2, 1, 2))

Comparing dot plot with jitter & histogram

Figure: The waiting time between eruptions for the
Figure: Old Faithful geyser in Yellowstone National Park

stripplot( ~ waiting, data = faithful, jitter.data = TRUE,
          factor = 3, xlab = "Waiting time (minutes)",
          main = "Waiting time between eruptions") |>
  print(split = c(1, 1, 1, 2), more = TRUE)
histogram( ~ waiting, data = faithful, nint = 25, 
          xlab = "Waiting time (minutes)") |>
  print(split = c(1, 2, 1, 2))

Comparing box plot & histogram

Figure: The waiting time between eruptions for the
Figure: Old Faithful geyser in Yellowstone National Park

bwplot( ~ waiting, data = faithful, pch = "|", coef = 0, 
       xlab = "Waiting time (minutes)",
       main = "Waiting time between eruptions") |>
  print(split = c(1, 1, 1, 2), more = TRUE)
histogram( ~ waiting, data = faithful, nint = 25, 
          xlab = "Waiting time (minutes)") |>
  print(split = c(1, 2, 1, 2))

CS 1.3: Block weights

Sample data from a similar woodblock exercise used in the first lecture. The exercise aimed to estimate the average block weight using only a sample of blocks.

Variables
Block.ID An integer between 1–100 denoting the block’s identification number
Weight A number denoting the weight of the block (grams)
blocks.df <- read.csv("datasets/random-blocks.csv")
nrow(blocks.df)
[1] 10
summary(blocks.df)
    Block.ID         Weight     
 Min.   : 6.00   Min.   : 7.80  
 1st Qu.:23.75   1st Qu.:13.85  
 Median :53.50   Median :28.20  
 Mean   :52.80   Mean   :28.68  
 3rd Qu.:83.25   3rd Qu.:37.25  
 Max.   :98.00   Max.   :58.30  
head(blocks.df, n = 10)
   Block.ID Weight
1        46   30.3
2        61    8.1
3        16   33.8
4        90    7.8
5        22   38.4
6         6   52.8
7        98   19.1
8        97   12.1
9        63   58.3
10       29   26.1

Mean1

A measure of centre that is often coined as the balancing point of the variable

Let \(x_i\) be the \(i\)th value and \(n\) be the total number of observations. Then, the sample mean is

\[ \bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{\sum^n_{i=1} x_i}{n} \]

Figure: The weight of the ten blocks, with the mean
Figure: superimposed

blocks.df$Weight
 [1] 30.3  8.1 33.8  7.8 38.4 52.8 19.1 12.1 58.3 26.1

28.68

Standard deviation

A measure of spread that is mathmatically associated with the mean

Let \(x_i\) be the \(i\)th value, \(\bar{x}\) be the sample mean, and \(n\) be the total number of observations. Then, the sample standard deviation is

\[ \begin{aligned} s &= \sqrt{\frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \cdots + (x_n - \bar{x})^2}{n - 1}} \\ &= \cdots \\ &= \sqrt{\frac{\sum^n_{i=1} (x_i)^2 - n\bar{x}^2}{n - 1}} \end{aligned} \]

Figure: The weight of the ten blocks, with the mean
Figure: and standard deviation superimposed

mean(blocks.df$Weight)
[1] 28.68
blocks.df$Weight^2
 [1]  918.09   65.61 1142.44   60.84 1474.56 2787.84  364.81  146.41 3398.89
[10]  681.21

17.6863915

More on standard deviation

If the data has one mode (unimodal) and is relatively symmetrical, then approximately:

  • 68.3% of the all values are within one \(s\)
    of the \(\bar{x}\)
  • 95.4% of the all values are within two \(s\)
    of the \(\bar{x}\)
  • 99.7% of the all values are within three \(s\)
    of the \(\bar{x}\)
sd(blocks.df$Weight) # Calculate the sample standard deviation
[1] 17.68639

Figure: The weight of the ten blocks, with the mean
Figure: and standard deviation superimposed

Median

A measure of centre that is often coined as the middle value (50th percentile) of the variable

Let \(n\) be the total number of observations. Then, the sample median, \(m\), can be determined by

  1. Sort the values in ascending (or descending) order
    • If \(n\) is odd, then \(m\) is the value of the \(\frac{n + 1}{2}\)th observation
    • If \(n\) is even, then \(m\) is the mean of the values from the observations “above” and “below” \(\frac{n + 1}{2}\)

Figure: The weight of the ten blocks, with the median
Figure: superimposed

blocks.df$Weight
 [1] 30.3  8.1 33.8  7.8 38.4 52.8 19.1 12.1 58.3 26.1

28.2

Range

A measure of spread that describes the width of the variable

The range is

The difference between the observations with the largest and smallest values

Figure: The weight of the ten blocks, with the median
Figure: and range superimposed

summary(blocks.df$Weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   7.80   13.85   28.20   28.68   37.25   58.30 

50.5

Interquartile range

A measure of spread that describes the width of the central 50% of the variable

The interquartile range, \(IQR\), can be determined by

  1. Sort the values in ascending (or descending) order
  2. Calculate the lower quartile, 1st Qu., which could be median for lower 50% of the data
  3. Calculate the upper quartile, 3rd Qu., which could be median for upper 50% of the data
  4. Calculate the difference between the upper and lower quartiles

Figure: The weight of the ten blocks, with the median
Figure: and interquartile range superimposed

summary(blocks.df$Weight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   7.80   13.85   28.20   28.68   37.25   58.30 

23.4

More on interquartile range

Recall that box plots visualise “outliers” by default

The (approximate) rules used by most software and packages are:

  • Values greater than: \(\text{Upper Quartile} + 1.5 \times IQR\)
  • Values less than: \(\text{Lower Quartile} - 1.5 \times IQR\)
summary(nzis.df$income)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-5100.0   240.5   536.0   692.1   963.0 25443.0 
1.5 * IQR(nzis.df$income) 
[1] 1083.75
bwplot( ~ income, data = nzis.df, pch = "|",
       xlab = "Gross Weekly Income ($)", 
       xlim = c(-2500, 2500),
       main = "NZer's gross weekly income snapshot in 2011")

Figure: The gross weekly income of 29447 New
Figure: Zealanders with the horizontal (\(x\)) axis
Figure: truncated between -$2,500 and $2,500

-843.25, 2046.75

Reference: Terminology for one numeric variable

Centre
The “typical size” of the data, e.g. the sample mean & median

Spread
The “variability” of the data, e.g. the sample standard deviation, range, & interquartile range

Outliers
An observation whose value is notably distinct from other values in the data

Cluster
A distinct group of observations—see CS 1.2

Shape (Distribution)
The form of the data, e.g. U-shaped or bell-shaped

Mode (Distribution)
The “frequent value(s)” of the data, e.g. the peaks of a histogram

Symmetrical (Distribution)
A distribution where the two sides approximately match when folded on a vertical centre line

Skewed to the left (Distribution)
A distribution where the data piles up on the right and the tail extends relatively far out to the left

Skewed to the right (Distribution)
A distribution where the data piles up on the left and the tail extends relatively far out to the right

Two numeric variables

  • Describing the relationship between two numeric variables

CS 1.4: Snapper in the gulf

Weight and length measures of 844 snapper, Pagrus auratus, caught in the Hauraki Gulf, near Auckland, New Zealand.

Variables
len A number denoting the fork length1 of the fish (centimetres)
wgt A number denoting the weight of the fish (kilograms)
snapper.df <- read.csv("datasets/snapper.csv")
nrow(snapper.df)
[1] 844
summary(snapper.df)
      len             wgt        
 Min.   :25.10   Min.   : 0.330  
 1st Qu.:38.10   1st Qu.: 1.150  
 Median :44.80   Median : 1.820  
 Mean   :46.25   Mean   : 2.298  
 3rd Qu.:53.40   3rd Qu.: 2.920  
 Max.   :86.10   Max.   :12.440  
head(snapper.df)
   len  wgt
1 44.8 1.92
2 42.1 1.37
3 41.3 1.42
4 50.6 2.55
5 42.8 1.54
6 55.1 3.36

Scatter plot

xyplot(len ~ wgt, data = snapper.df,
       main = "Scatter plot of snapper fork length vs weight",
       xlab = "Weight (kg)", ylab = "Fork length (cm)")

Figure: The fork length and weight of 844 snapper

Arguments
ylab takes text to label to vertical (\(y\)) axis

The location of each observation is determined by the value of the two visualised variables

Scatter plots help us describe the relationship between two variables

A simple description addresses the direction (positive or negative) and type of relationship (linear, non-linear, or “none”)

There is a positive non-linear relationship between the fork length and weight of snapper

Is it linear or non-linear?

xyplot(len ~ wgt, data = snapper.df, 
       type = c("p", "r"), col.line = "black", lwd = 2,
       main = "Scatter plot of snapper fork length vs weight",
       xlab = "Weight (kg)", ylab = "Fork length (cm)")

Figure: The fork length and weight of 844 snapper with the
Figure: best-fit line superimposed

The extra arguments, type = c("p", "r"), col.line = "black", and lwd = 2 adds a distinct best-fit line to the scatter plot

The best-fit line is a useful aid to help determine if the relationship between two numeric variables is linear

Correlation

A measure of the strength and direction of a linear association between two numeric variables

cor.test( ~ len + wgt, data = snapper.df)$estimate
      cor 
0.9513067 

If it was appropriate for the snapper data, then \(r=0.95\) (2 dp)

\(r\) helps us describe the strength of a linear association… Why not the strength of a linear relationship?

cor.test( ~ wgt + len, data = snapper.df)$estimate
      cor 
0.9513067 

Values of \(r\) close to \(+1\) or \(-1\) show a strong linear association, while values of \(r\) close to \(0\) show no linear association

Why do we only use r for linear relationships?

Equation Reference: Correlation

\[ r = \frac{\sum_{i=1}^n (x_i \cdot y_i) - n \cdot \bar{x} \cdot \bar{y}}{(n - 1) \cdot s_x \cdot s_y}. \]

\(\text{Let:}\)

\(\bullet ~ n ~ \text{be total number of observations}\)
\(\bullet ~ x_i ~ \text{and} ~ y_i ~ \text{be the} ~ i^\text{th} ~\text{observation's values for the} ~ x ~ \text{and} ~ y ~ \text{variables}\)
\(\bullet ~ \bar{x} ~ \text{and} ~ \bar{y} ~ \text{be the sample means of the} ~ x ~ \text{and} ~ y ~ \text{variables}\)
\(\bullet ~ s_{x} ~ \text{and} ~ s_{y} ~ \text{be the sample standard deviations of the} ~ x ~ \text{and} ~ y ~ \text{variables}\)

One categorical variable

  • What is the general distribution of the data?

CS 1.5: Wordle

A snapshot of Wordle1 guess distributions from David and his Wordle obsessed friends.

Variables
Count An integer denoting the frequency of Guesses
Initials A factor denoting whose Wordle guess distribution it is with 5 levels
Guesses A factor denoting how many guesses it took to complete the daily Wordle (as you lose if your 6th guess is incorrect) with 7 levels
wordle.df <- read.csv("datasets/wordle.csv")
nrow(wordle.df)
[1] 35
summary(wordle.df)
     Count          Initials           Guesses         
 Min.   :  0.00   Length:35          Length:35         
 1st Qu.:  1.00   Class :character   Class :character  
 Median : 10.00   Mode  :character   Mode  :character  
 Mean   : 32.49                                        
 3rd Qu.: 43.50                                        
 Max.   :175.00                                        
head(wordle.df, n = 7)
  Count Initials Guesses
1     0     D.C.       1
2    15     D.C.       2
3    81     D.C.       3
4   110     D.C.       4
5    87     D.C.       5
6    33     D.C.       6
7    16     D.C.    Lost

Bar plot of counts

xtabs(Count ~ Guesses, data = wordle.df) |> 
  as.data.frame() |>
  barchart(Freq ~ Guesses, data = _, origin = 0, xlab = "Guesses", 
           ylab = "Counts", main = "Wordle guess distribution")

Figure: The Wordle guess distribution for David and friends

Arguments
origin = 0 ensures that the bars start at 0 (!?!)

The variable is visualised as a set of bars, one for each level, and the height of each bar is the frequency of the level

The term frequency is used to describe the number of observations with that specific level (category)

Producing a bar plot of counts with tidy data involves a bit more R code for a sensible bar plot1

The most frequent number of guesses required for a Wordle game was four guesses followed by five guesses

Bar plot of proportions

xtabs(Count ~ Guesses, data = wordle.df) |> 
  proportions() |> # Converts the frequencies into proportions
  as.data.frame() |>
  barchart(Freq ~ Guesses, data = _, origin = 0, xlab = "Guesses", 
           ylab = "Proportions", main = "Wordle guess distribution")

Figure: The Wordle guess distribution for David and friends

Arguments
The addition of the proportions() line tells R to do the necessary proportion calculations

The variable is visualised as a set of bars, one for each level, and the height of each bar is the proportion of all observations with that level

Bar plot of counts and of proportions are identical for a single categorical variable

However, the frequencies and \(n\) that were used to calculate the proportions are now hidden

The benefit of visualising proportions instead of counts is more evident for two categorical variables

The most frequent number of guesses required for a Wordle game was four guesses followed by five guesses \((n = 1137)\)

CS 1.6: ICU admissions

Data from a sample of 200 patients following admission to an adult intensive care unit (ICU) in the United States of America.

Variables
Status A factor denoting whether the patient lived or died
Sex A factor denoting the patient’s sex, male or female
Race A factor denoting the patient’s race, white, black or other
Infection A factor denoting whether an infection was involved, yes or no
Previous A factor denoting whether the patient has been admitted to ICU within the last 6 months
Type A factor denoting the type of ICU admission, elective or emergency
Fracture A factor denoting whether a fractured bone was involved, yes or no
icu.df <- read.csv("datasets/ICU.csv")
nrow(icu.df)
[1] 200
summary(icu.df)
    Status              Sex                Race            Infection        
 Length:200         Length:200         Length:200         Length:200        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
   Previous             Type             Fracture        
 Length:200         Length:200         Length:200        
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
head(icu.df)
  Status    Sex  Race Infection Previous      Type Fracture
1  lived female white       yes       no emergency       no
2  lived   male white        no      yes emergency       no
3  lived   male white        no       no  elective       no
4  lived   male white       yes       no emergency      yes
5  lived female white       yes      yes emergency       no
6  lived   male white       yes       no emergency       no

Frequency tables

The bar plots described prior are visualised from frequency tables1

In R, the construction of frequency tables… depends on the data!

If the data already has a numeric variable for the counts of each level of the categorical variable, e.g. CS 1.5:

xtabs(Count ~ Guesses, data = wordle.df)
Guesses
   1    2    3    4    5    6 Lost 
   1   38  206  396  359  107   30 

If the data simply has a categorical variable describing the level for each observation, e.g. CS 1.6:

xtabs( ~ Sex, data = icu.df)
Sex
female   male 
    76    124 

Proportion

A measure that describes the number of observations within a level of a categorical variable as a number between \(0\) and \(1\) (inclusive)

Let \(n\) be the total number of observations. Then, the sample proportion for some category is

\[ \widehat{p} = \frac{\text{Number in that level}}{n} \]

xtabs( ~ Sex, data = icu.df) |>
  addmargins() # Calculate n as part of the frequency table
Sex
female   male    Sum 
    76    124    200 

0.38, 0.62

Percentage…?

Proportions as defined in the previous slide are interchangeable with percentages

Note that a percentage must be written in % (percent) units

Let \(\widehat{p}\) be the sample proportion for some category. Then, the corresponding percentage is

\[ \text{Percentage} = \left(100 \times \widehat{p}\right)\!\% \]

xtabs( ~ Sex, data = icu.df) |>
  proportions()
Sex
female   male 
  0.38   0.62 

38, 62

Reference: Bar plot of percentages

xtabs(Count ~ Guesses, data = wordle.df) |> 
  proportions() |> # Converts the frequencies into proportions
  as.data.frame() |>
  barchart(100 * Freq ~ Guesses, data = _, origin = 0, xlab = "Guesses", 
           ylab = "Percentage (%)", main = "Wordle guess distribution")

Figure: The Wordle guess distribution for David and friends

Two categorical variables

  • Describing the relationship between two categorical variables

Side-by-side bar plots of counts

xtabs( ~ Status + Race, data = icu.df) |>
  as.data.frame() |>
  barchart(Freq ~ Status, groups = Race, data = _, origin = 0,
           main = "Status distribution by Race", 
           xlab = "Status", ylab = "Count",
           auto.key = list(title = "Race", space = "right"))

Figure: The ICU patient distribution of Status by Race

The two variables are visualised as a set of bars, one for each level combination, and the height of each bar is the frequency of the level combination

Describing relationships between two categorical variables is tricky

The frequencies of each level combination can lead to inappropriate conclusions if the levels of the “by” categorical variable do not have similar frequencies

Arguments
auto.key = list(title = "Race", space = "right") includes a legend on the right-hand side of the plot titled “Race”

Side-by-side bar plots of proportions by level

xtabs( ~ Status + Race, data = icu.df) |>
  proportions("Race") |>
  as.data.frame() |>
  barchart(Freq ~ Status, groups = Race, data = _, origin = 0, 
           main = "Status distribution by Race",
           xlab = "Status", ylab = "Proportion",
           auto.key = list(title = "Race", space = "right"))

Figure: The ICU patient distribution of Status by Race

The two variables are visualised as a set of bars, one for each level combination, and the height of each bar is the proportion within a level of a variable given the level of another variable

Proportions prevent us from quantifying a relationship that is only pronounced due to the frequencies

Note that the proportions from the same coloured bars sum to one, which is why the “X distribution by Y” phrase is used in the plot’s title

It seems that more “black” patients who were admitted into ICU lived compared to “other” and “white” patients

Arguments
proportions("Race") tells R to calculate conditional proportions of Status for each level of Race
groups = Race splits the bars, side-by-side, for Status by the levels of Race

Two-way table

The bar plots described in this section are visualised from two-way tables1

In R, the construction of two-way tables also depends on the data!

If the data already has a numeric variable for the counts of each level combination of the two categorical variables, e.g. CS 1.5:

xtabs(Count ~ Guesses + Initials, data = wordle.df)
       Initials
Guesses D.C. E.K. J.C. J.N. K.C.
   1       0    0    0    1    0
   2      15    0    2    6   15
   3      81    0   10   23   92
   4     110    2   23   86  175
   5      87    1   18  165   88
   6      33    1    5   54   14
   Lost   16    1    1    9    3

If the data has two categorical variables describing the respective levels for each observation, e.g. CS 1.6:

xtabs( ~ Sex + Race, data = icu.df)
        Race
Sex      black other white
  female     5     4    67
  male      10     6   108

Proportion for a given level1

A measure that describes the number of observations within a level of a categorical variable given the level of another categorical variable as a number between \(0\) and \(1\) (inclusive)

A proportion calculated this way can also be interpreted as a percentage

Let \(n_\bullet\) be the total number of observations for a category level. Then, the sample proportion for some other category given the category level is

\[ \widehat{p}_\bullet = \frac{\text{Number in both levels}}{n_\bullet} \]

xtabs( ~ Race + Sex, data = icu.df) |>
  addmargins() # Calculate the n.dots as part of the
                # two-way table
       Sex
Race    female male Sum
  black      5   10  15
  other      4    6  10
  white     67  108 175
  Sum       76  124 200

0.3333333, 0.6666667, 0.4, 0.6, 0.3828571, 0.6171429

Reference: Side-by-side bar plots of percentages by
Reference: level

xtabs( ~ Sex + Race, data = icu.df) |>
  proportions("Race") |>
  as.data.frame() |>
  barchart(100 * Freq ~ Sex, groups = Race, data = _, origin = 0,
           main = "Sex distribution by Race",
           xlab = "Sex", ylab = "Percentage (%)",
           auto.key = list(title = "Race", space = "right"))

Figure: The ICU patient distribution of Sex by Race

Reference: Proportions versus probabilities

The concepts of a proportion and a probability are quite distinct. A proportion is a partial description of a real population—a form of summary. Probabilities tell us about the chances of something happening in a random experiment. The fact that proportions are numerically identical to probabilities for a real population under the experiment “choose a unit at random,” however, means that we can use the probability notation and any formulas derived for manipulating probabilities to solve problems involving proportions as well.

— Wild & Seber (2000)